DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2017.3

Better Greedy Sequence Clustering with Fast Banded Alignment

Authors: Brian Brubach, Jay Ghurye, Mihai Pop, and Aravind Srinivasan

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

Comparing a string to a large set of sequences is a key subroutine in greedy heuristics for clustering genomic data. Clustering 16S rRNA gene sequences into operational taxonomic units (OTUs) is a common method used in studying microbial communities. We present a new approach to greedy clustering using a trie-like data structure and Four Russians speedup. We evaluate the running time of our method in terms of the number of comparisons it makes during clustering and show in experimental results that the number of comparisons grows linearly with the size of the dataset as opposed to the quadratic running time of other methods. We compare the clusters output by our method to the popular greedy clustering tool UCLUST. We show that the clusters we generate can be both tighter and larger.

Cite as

Brian Brubach, Jay Ghurye, Mihai Pop, and Aravind Srinivasan. Better Greedy Sequence Clustering with Fast Banded Alignment. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 3:1-3:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{brubach_et_al:LIPIcs.WABI.2017.3,
  author =	{Brubach, Brian and Ghurye, Jay and Pop, Mihai and Srinivasan, Aravind},
  title =	{{Better Greedy Sequence Clustering with Fast Banded Alignment}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{3:1--3:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.3},
  URN =		{urn:nbn:de:0030-drops-76425},
  doi =		{10.4230/LIPIcs.WABI.2017.3},
  annote =	{Keywords: Sequence Clustering, Metagenomics, String Algorithms}
}

Document

DOI: 10.4230/LIPIcs.WABI.2017.23

Outlier Detection in BLAST Hits

Authors: Nidhi Shah, Stephen F. Altschul, and Mihai Pop

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. The similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. On the other hand, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets. Our experiments demonstrate the potential of our method to be a filtering step before using phylogenetic methods.

Cite as

Nidhi Shah, Stephen F. Altschul, and Mihai Pop. Outlier Detection in BLAST Hits. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 23:1-23:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{shah_et_al:LIPIcs.WABI.2017.23,
  author =	{Shah, Nidhi and Altschul, Stephen F. and Pop, Mihai},
  title =	{{Outlier Detection in BLAST Hits}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{23:1--23:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.23},
  URN =		{urn:nbn:de:0030-drops-76512},
  doi =		{10.4230/LIPIcs.WABI.2017.23},
  annote =	{Keywords: Taxonomy classification, Metagenomics, Sequence alignment, Outlier detection}
}

Document

DOI: 10.4230/DagRep.6.8.91

Next Generation Sequencing (Dagstuhl Seminar 16351)

Authors: Gene Myers, Mihai Pop, Knut Reinert, and Tandy Warnow

Published in: Dagstuhl Reports, Volume 6, Issue 8 (2017)

Abstract

Next Generation Sequencing (NGS) data have begun to appear in many applications that are clinically relevant, such as resequencing of cancer patients, disease-gene discovery and diagnostics for rare diseases, microbiome analyses, and gene expression profiling. The analysis of sequencing data is demanding because of the enormous data volume and the need for fast turnaround time, accuracy, reproducibility, and data security. This Dagstuhl Seminar aimed at a free and deep exchange of ideas and needs between the communities of algorithmicists and theoreticians and practitioners from the biomedical field. It identified several relevant fields such as data structures and algorithms for large data sets, hardware acceleration, new problems in the upcoming age of genomes, etc. which were discussed in breakout groups.

Cite as

Gene Myers, Mihai Pop, Knut Reinert, and Tandy Warnow. Next Generation Sequencing (Dagstuhl Seminar 16351). In Dagstuhl Reports, Volume 6, Issue 8, pp. 91-130, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@Article{myers_et_al:DagRep.6.8.91,
  author =	{Myers, Gene and Pop, Mihai and Reinert, Knut and Warnow, Tandy},
  title =	{{Next Generation Sequencing (Dagstuhl Seminar 16351)}},
  pages =	{91--130},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2017},
  volume =	{6},
  number =	{8},
  editor =	{Myers, Gene and Pop, Mihai and Reinert, Knut and Warnow, Tandy},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagRep.6.8.91},
  URN =		{urn:nbn:de:0030-drops-68395},
  doi =		{10.4230/DagRep.6.8.91},
  annote =	{Keywords: Cancer, DNA Sequence Assembly, Expression Profiles, Next Generation Sequencing, Sequence analysis, Software Engineering (Tools \& Libraries)}
}

Search Results

Documents authored by Pop, Mihai

Better Greedy Sequence Clustering with Fast Banded Alignment

Abstract

Cite as

Outlier Detection in BLAST Hits

Abstract

Cite as

Next Generation Sequencing (Dagstuhl Seminar 16351)

Abstract

Cite as

Thanks for your feedback!

Could not send message